Encontro 2 | 19/08/20024 Henrique Costa | Métodos Estratégicos em FinQuant
Dados e R
Funções
Recipes allow chefs to cook up tasty treats
Recipes call for ingredients
Recipes involve one or more steps
Steps transform ingredients into treats
Functions are like customizable recipes
Functions call for inputs (“arguments”)
Functions involve one or more lines of code
Code transforms inputs into outputs
Using functions requires parentheses (usually)
out <- f(in1, in2)
Functions Live Coding
# USECASE: Function can perform a task more easily and readably# TEMPLATE: output <- function_name(input)9^ (1/2)x <-sqrt(9)x# ==============================================================================# LESSON: We can also use functions to transform objectsy <-9sqrt(y)# ==============================================================================# LESSON: We can even use functions to transform the result of calculations2/3round(2/3)# ==============================================================================# LESSON: We can customize what a function does using arguments# TEMPLATE: output <- function_name(argument, argument_name = argument_value)round(2/3, digits =2)round(2/3, digits =3)# ==============================================================================# LESSON: Some arguments are optional because they have default valuesround(2/3) # the default value for digits is 0round(2/3, digits =0)
Vectors
Vectors combine similar objects into a collection
I like to imagine a train pulling multiple cars
A vector is one object with many sub-objects
We refer to each sub-object as an element
Some functions transform each element in turn
Double the amount of cargo in every train car
Some functions summarize across elements
Calculate the total cargo across all train cars
v <- c(1, 2, 3)
Vectors Live Coding
# LESSON: We can combine multiple elements into a vector# TEMPLATE: vector_name <- c(element1, element2, element3)x <-491625# errorx <-c(4, 9, 16, 25)xy <-c(2, 3)y# ==============================================================================# LESSON: We can also combine multiple vectors and elementsc(x, y)c(x, y, 20)# ==============================================================================# USECASE: Math operators will transform each element individuallyx +1x *3x # but again, this won't be saved unless you use assignment# ==============================================================================# USECASE: Some functions will also transform each element individuallysqrt(x)log(x)# ==============================================================================# USECASE: Other functions will summarize the vector with a single numberlength(x)sum(x)mean(x)
Strings
When talking to R, we need a way to distinguish
Object/function names (e.g., the mean function)
Text/character data (e.g., the word mean)
Strings are R’s way of storing text data
Strings can store any characters (no rules!)
Strings are created and displayed with quotes
R has great tools for working with strings
Strings can be collected into vectors
Special functions can transform strings
name <- "John Doe"
Strings Live Coding
# USECASE: Strings are the main way to store character data in Rmy_color <- red # errormy_color <-"red"# correct# ==============================================================================# USECASE: Strings can also store symbols not allowed in object namesdye <-"red#40"dyedyes <-c("red#40", "blue#02")dyes# ==============================================================================# PITFALL: Many operations you can do to numbers won't work for stringsdyes +1# errormean(dyes) # error# ==============================================================================# USECASE: But other operations work for both or even just for stringslength(dyes)nchar(dyes)dyes2 <-toupper(dyes)dyes2
Packages
Cookbooks are a great way to learn to cook
They contain lots of recipes and instructions
Browse an online bookstore for a cookbook
Order it to add it to your personal bookshelf
To use, pull the cookbook off the shelf
Packages are like cookbooks for R
They contain helpful functions and datasets
Browse an online repository for a package
Install it to add it to your personal library
To use, load the package from the library
library("pkg_name")
Packages Live Coding
# USECASE: The stringr package adds a function to fix capitalizationstudents <-c("mary anne", "BENjamin", "Lee")# ==============================================================================# PITFALL: But we can't use that function without installing the packagestr_to_title(students) # error# ==============================================================================# LESSON: Installing a package using RStudio# - RStudio > Extras pane > Packages tab > Install button# ==============================================================================# PITFALL: We also need to load the package before we can use itstr_to_title(students) # error# ==============================================================================# LESSON: We load the package using library()library("stringr")str_to_title(students) #finally works!# ==============================================================================# LESSON: We can also keep our packages updated using RStudio# RStudio > Extras pane > Packages tab > Update button
Wrangle I
Tidy Data Principles
There are many ways to store data
We will be learning the tidy data format
Data should be rectangular
Each variable has its own column
Each observation has its own row
Each value has its own cell
Other Data Advice
Name all variables in the first row
This is called a header row
Avoid merged cells for data storage
These are okay for communication
Avoid empty cells whenever possible
Mark missing data as NA
Avoid formatting-as-data for storage
e.g., non-redundant color-coding
Tidying Example 1
Not Tidy
Name
Ann
Bob
Cat
Dom
Age
13
10
11
11
Weight
56.4
46.8
41.3
43.3
❌ Here, each row is a variable and each column is an observation.
Tidy
Name
Age
Weight
Ann
13
56.4
Bob
10
46.8
Cat
11
41.3
Dom
11
43.3
✔️ Here, each column is a variable and each row is an observation.
Tidying Example 2
Not Tidy
Names:
Ann
Bob
Cat
Dom
Age
Weight
13
56.4
10
46.8
11
41.3
11
43.3
❌ Here, we have data that is not rectangular because the Names variable has its own row.
Tidy
Name
Age
Weight
Ann
13
56.4
Bob
10
46.8
Cat
11
41.3
Dom
11
43.3
✔️ Here, we have made the data rectangular by moving the Names variable to its own column.
Tidying Example 3
Not Tidy
country
year
cases / population
Afghanistan
1999
NA / 19987071
2000
2666 / 20595360
Brazil
1999
37737 / 172006362
2000
80488 / 174504898
China
1999
212258 / 1272915272
2000
213766 / 1280428583
❌ Here, we have merged cells and two values stored in a single cell.
Tidy
country
year
cases
population
Afghanistan
1999
NA
19987071
Afghanistan
2000
2666
20595360
Brazil
1999
37737
172006362
Brazil
2000
80488
174504898
China
1999
212258
1272915272
China
2000
213766
1280428583
✔️ Here, we have un-merged the countries and separated the cases and populations variables into columns.
Tidying Example 4
Not Tidy
student
grade
Amber
91.5
A-
Bristol
86.2
B
Charlene
94.0
A
Diego
89.3
B+
Legend: Psych. Major, Psych. Minor
❌ Here, we have a missing variable name and formatting-as-data.
Tidy
student
psych
grade
letter
Amber
major
91.5
A-
Bristol
minor
86.2
B
Charlene
major
94.0
A
Diego
NA
89.3
B+
✔️ Here, we have added a column for the psych variable, removed the legend, and named the letter variable.
Tidying Example 5
Not Tidy
student
grade
letter
Amber
91.5
A-
Bristol*
94.2
A
Class Summary
As
2
Yay!
Bs
0
*Grade was revised.
❌ Here, we have two types of data in one file and a footnote as data.
Tidy
student
grade
letter
revised
Amber
91.5
A-
FALSE
Bristol
94.2
A
TRUE
letter
count
notes
A
2
Yay!
B
0
✔️ Here, we have split the data into two separate tables and added the revised and notes variables.
Long vs. Wide Format
Wide Format
date
Boeing
Amazon
Google
2009-01-01
$173.55
$174.90
$174.34
2009-01-02
$172.61
$171.42
$170.04
✔️ Here, we have a wide format where each observation is a date.
Long Format
date
stock
price
2009-01-01
Boeing
$173.55
2009-01-01
Amazon
$174.90
2009-01-01
Google
$174.34
2009-01-02
Boeing
$172.61
2009-01-02
Amazon
$171.42
2009-01-02
Google
$170.04
✔️ Here, we have a long format where each observation is the combination of a date and a stock.
Tibbles
R works particularly well with tidy data
We store tidy data in data frames or tibbles
Tibbles are just fancier data frames (i.e., they have a few extra features)
To use tibbles, we need the tidyverse package
Tibbles are constructed from one or more vectors
The vectors must have the same length
They can contain different types of data
Vectors
We start with three separate vector objects that all have the same length.
We set it up so that the \(n\)-th car in each train corresponds to the same observation.
Tibble
Then we combine the vectors into a single tibble (or data frame) object.
Now, as the tibble moves around, the variables always stay together.
Tibbles Live Coding
# SETUP: Install and load the tidyverse package# Extras pane > Packages tab > Installlibrary(tidyverse)# ==============================================================================# LESSON: Create a tibble from vectorsx <-c(10, 20, 30, 40)xy <- x *2-4ymy_tibble <-tibble(x, y)my_tibble# ==============================================================================# USECASE: You can mix different types of vectors in a single tibblefirst_names <-c("Adam", "Billy", "Caitlyn", "Debra")age_years <-c(12, 13, 10, NA)guests <-tibble(first_names, age_years)guests# ==============================================================================# TIP: To save time, you can also create the vectors in the tibble callgradebook <-tibble(grade =c(95, 83, 90, 76),letter =c("a", "b", "a-", "c"))gradebook# ==============================================================================# PITFALL: Don't try to combine tibbles with different lengthsy <-c(1, 2, 3)x <-c("a", "b")tibble(y, x) #error# ==============================================================================# LESSON: However, the exception is R will "recycle" a single valuetibble(y, x ="a")# ==============================================================================# LESSON: You can "extract" a vector from a tibble using $mytibble <-tibble(x =c(1, 2, 3, 4, 5), y ="test")mytibble$xmytibble$y# ==============================================================================# PITFALL: Don't try to extract a vector that doesn't existmytibble$z #error
Importing and Exporting
Data is usually stored in data files
Importing files into R is called reading
Exporting files from R is called writing
A convenient data file type is a CSV
This stands for comma-separated values
A CSV file is easy to share with other people
The tidyverse package can read/write CSVs
Other packages can read/write other types (e.g., readxl, haven, rio, googlesheets4)
Read/Write Live Coding
# SETUP: Load the tidyverse package (if you haven't yet)library(tidyverse)# ==============================================================================# USECASE: Create a tibble and write it to a filegradebook <-tibble(id =c(123, 456, 789),grade =c("A", "B", "A"))gradebookwrite_csv(gradebook, file ="gradebook.csv")# NOTE: You can see the new file in Extras pane > Files tab.# You can open the file in another program (e.g., Microsoft Excel).# You can also email this file to someone else to share it.# ==============================================================================# PITFALL: Don't swap the order of the tibble and the filewrite_csv("gradebook.csv", gradebook) # error# ==============================================================================# USECASE: Read in a file containing dataold_gradebook <-read_csv("gradebook.csv")old_gradebook# NOTE: read_csv() will examine and guess the data type of each variable.# You can tell it the data type of each variable, but that is more advanced.# ==============================================================================# PITFALL: Don't use the read.csv() and write.csv() functionsold_gradebook <-read.csv("gradebook.csv") # not a tibbleold_gradebook
Wrangle II
Basic wrangling verbs
tidyverse provides tools for wrangling tibbles
These functions are named after verbs
So if you name your objects after nouns…
…your code becomes easier to read
Noun(noun) ❌
Verb(noun) ✔️
blender(fruit)
blend(fruit)
screwdriver(screw)
drive(screw)
boxcutter(box)
cut(box)
Column-focused verbs
Select retains only certain columns/variables
select(TBL, VAR_KEEP, -VAR_DROP)
Mutate adds or transforms columns/variables
mutate(TBL, NEW_VAR = OLD_VAR / 1000)
Rename changes the names of columns/variables
rename(TBL, NEW_NAME = OLD_NAME)
Relocate changes the order of columns/variables
relocate(TBL, VAR_MOVE, .after = OTHER_VAR)
Select Live Coding
# SETUP: Load package and inspect example tibblelibrary(tidyverse) # includes the dplyr packagestarwars# ==============================================================================# USECASE: Retain only the specified variablessw <-select(starwars, name)swsw <-select(starwars, name, sex, species)sw# ==============================================================================# PITFALL: Don't forget to save the change with assignmentselect(starwars, name, sex, species)starwars # still includes all variables# ==============================================================================# USECASE: Retain all variables between two variablessw <-select(starwars, name, hair_color:eye_color)sw# ==============================================================================# USECASE: Retain all variables except the specified onessw <-select(starwars, -sex, -species)swsw <-select(starwars, -c(sex, species))swsw <-select(starwars, -c(hair_color:starships))sw
# USECASE: Change the name of one or more variablesstarwarssw <-rename(starwars, Character = name)swsw <-rename(starwars, height_cm = height, mass_kg = mass)sw# ==============================================================================# PITFALL: Don't swap the order and try old_name = new_namesw <-rename(starwars, name = Character) # error# ==============================================================================# USECASE: Move variables before or after another variablestarwarssw <-relocate(starwars, species, sex, .before = height)swsw <-relocate(starwars, species, sex, .after = name)sw# ==============================================================================# PITFALL: Don't forget the period!sw <-relocate(starwars, sex, before = height) sw # height was accidentally renamed to before
Row-focused verbs
Arrange sorts rows based on their values
arrange(TBL, VAR_SORT_UP)
arrange(TBL, desc(VAR_SORT_DOWN))
arrange(TBL, VAR_SORT_1ST, VAR_SORT_2ND)
Filter retains certain rows based on criteria
filter(TBL, DBL_CRIT > 0)
filter(TBL, STR_CRIT == "A")
filter(TBL, CRIT1, CRIT2)
Arrange Live Coding
# USECASE: Sort observations by a variablestarwarssw <-arrange(starwars, height)sw # sorted by height, ascendingsw <-arrange(starwars, name)sw # sorted by name, alphabetically# ==============================================================================# USECASE: Sort observations by a variable, in reverse ordersw <-arrange(starwars, desc(height))sw # sorted by height, descendingsw <-arrange(starwars, desc(name))sw # sorted by name, reverse-alphabetically# ==============================================================================# USECASE: Sort observations by multiple variablessw <-arrange(starwars, hair_color, mass)sw # sorted by hair_color, then ties broken by mass
Filter Live Coding
# USECASE: Retain only observations that meet a criterionsw <-filter(starwars, mass >100)sw # only observations with mass greater than 100sw <-filter(starwars, mass <=100)sw # only observations with mass less than or equal to 100sw <-filter(starwars, species =="Human")sw # only observations with species equal to Humansw <-filter(starwars, species !="Human")sw # only observations with species not equal to Human# ==============================================================================# PITFALL: Don't try to use a single = for testing equalitysw <-filter(starwars, height =150) # errorsw <-filter(starwars, height ==150) # correctsw # ==============================================================================# PITFALL: Don't forget that R is case-sensitivesw <-filter(starwars, species =="human")sw # no observations left (because it should be Human)# ==============================================================================# USECASE: Retain only observations that meet complex criteriasw <-filter(starwars, mass >100& height >200)sw # only observations with mass over 100 AND height over 200sw <-filter(starwars, height <100| hair_color =="none")sw # only observations with height under 100 OR hair_color equal to none# ==============================================================================# PITFALL: Don't forget to complete both conditionssw <-filter(starwars, mass >100&<200) # errorsw <-filter(starwars, mass >100& mass <200) # correctsw# ==============================================================================# PITFALL: Don't try to equate a string to a vectorsw <-filter(starwars, species ==c("Human", "Droid")) # errorsw <-filter(starwars, species %in%c("Human", "Droid")) # correctsw
Filter Cheatsheet
Symbol
Description
Num
Chr
<
Less than
Yes
No
<=
Less than or equal to
Yes
No
>
More than
Yes
No
>=
More than or equal to
Yes
No
==
Equal to
Yes
Yes
!=
Not equal to
Yes
Yes
%in%
Found in
Yes
Yes
&
Logical And
Yes
Yes
|
Logical Or
Yes
Yes
Wrangle III
Pipes & Pipelines
How can we do multiple operations to an object?
x <- 10
x2 <- sqrt(x)
x3 <- round(x2)
This works but is cumbersome and error-prone
A better approach is to use pipes and pipelines
x3 <- 10 |> sqrt() |> round()
I like to read |> as “and then…”
“Take 10 and then sqrt() and then round()”
Pipes Live Coding
# SETUP: Enable the pipe operator shortcut# Tools > Global Options... > Code tab > Check "Use Native Pipe Operator"# Type out |> or press Ctrl+Shift+M (Windows) / Cmd+Shift+M (Mac)# ==============================================================================# LESSON: The pipe pushes objects to a function as its first argument# TEMPLATE: x |> function_name() is the same as function_name(x)x <-10y <-sqrt(x)yy <- x |>sqrt()y# ==============================================================================# PITFALL: Don't forget to remove the object from the function callx |>sqrt(x) # wrongx |>sqrt() # correct# ==============================================================================# USECASE: You can still use arguments when pipingz <-round(3.14, digits =1)zz <-3.14|>round(digits =1)z# ==============================================================================# USECASE: Pipes are useful with tibbles and wrangling verbsstarwarssw <-select(starwars, name, species, height)swsw <- starwars |>select(name, species, height)sw# ==============================================================================# PITFALL: Don't add a pipe without a step after itsw <- starwars |>select(name, species, height) |># error
Pipelines Live Coding
# USECASE: You can chain multiple pipes together to make a pipelinex <-10|>sqrt() |>round()x# ==============================================================================# TIP: If you want to see the output of a pipeline, you can pipe to print()x <-10|>sqrt() |>round() |>print()# ==============================================================================# TIP: To make your pipelines more readable, move each step to a new linex <-10|>sqrt() |>round() |>print()# ==============================================================================# PITFALL: Don't put the pipe at the beginning of a line, thoughx <-10|>sqrt()|>round()|>print() # error# ==============================================================================# USECASE: Chain together a series of verbs to flexibly wrangle datatallones <- starwars |>select(name, species, height) |>rename(height_cm = height) |>mutate(height_ft = height_cm /30.48) |>filter(height_ft >7) |>arrange(desc(height_ft)) |>print()
Factors
Factors are used to represent categorical data
Factors have multiple possible levels
Levels are discrete and mutually-exclusive
Sometimes categories are unordered (nominal)
Action or Comedy or Drama
Asia or Europe or North America
Sometimes categories are ordered (ordinal)
Mild < Medium < Hot
XS < S < M < L < XL
Factors Live Coding
# USECASE: Ask 10 kids to order 1: nuggets, 2: pizza, or 3: saladfood <-c(2, 2, 1, 2, 1, 2, 1, 1, 2, 2)food# ==============================================================================# LESSON: We can turn this vector into a factor with factor()food2 <-factor(food, levels =c(1, 2, 3))food2food3 <-factor(food, levels =c(1, 2, 3),labels =c("nuggets", "pizza", "salad"))food3# ==============================================================================# USECASE: We can also quickly and easily count each level with table()table(food3)# ==============================================================================# PITFALL: Don't confuse levels and labelsfood4 <-factor(food, labels =c(1, 2, 3),levels =c("nuggets", "pizza", "salad"))food4 # full of <NA> because it can't find these levels# ==============================================================================# USECASE: You can also just enter strings directly (as self-labels)genre <-c("pop", "metal", "pop", "rock", "rap", "rap", "pop", "rock")genregenre2 <-factor(genre) # observed levels will be assigned alphabeticallygenre2table(genre2)# ==============================================================================# LESSON: If ordinal, enter levels low-to-high and add ordered = TRUEsalsa <-c("hot", "mild", "medium", "mild", "medium", "medium")salsa2 <-factor(salsa, levels =c("mild", "medium", "hot"), ordered =TRUE)salsa2 # NOTE: We may want to visualize or model ordinal factors differently# ==============================================================================# USECASE: Working with factors in a tibblecereal <-read_csv("cereal.csv")cerealcereal2 <-mutate(cereal, mfr =factor(mfr), type =factor(type))cereal2table(cereal2$mfr)table(cereal2$type)
Missing Values
Sometimes your data will have missing values
Perhaps these were never collected
Perhaps the values were lost/corrupted
Perhaps the participant didn’t respond
We need to tell R which values are missing
To do so, we set those values to NA
Functions from tidyverse make this easy
Missingness is often “contagious” in R e.g., a vector with NA has an unknown mean
Missing Values Live Coding
# SETUP: We will need tidyverse for the read and mutate functionslibrary(tidyverse)# ==============================================================================# PITFALL: Number codes for missingness will mess up calculations in Rheights <-c(149, 158, -999) # here we use -999 to represent a missing valuerange(heights)mean(heights)log(heights) # our missing value is no longer -999# ==============================================================================# USECASE: Use NA for missingness insteadheights2 <-c(149, 158, NA)heights2log(heights2) # the NA stayed an NA (due to contagiousness)# ==============================================================================# LESSON: Use na.rm = TRUE to do a summary function ignoring the NAsmean(heights2) # the mean is an NA (due to contagiousness)mean(heights2, na.rm =TRUE)range(heights2, na.rm =TRUE)# ==============================================================================# USECASE: Dealing with missing values in tibblescereal <-read_csv("cereal.csv")cereal$ratingrange(cereal$rating)# ==============================================================================# LESSON: Use na_if() to convert specific values to NA while mutatingcereal2 <-mutate(cereal, rating =na_if(rating, -999))cereal2$ratingrange(cereal2$rating, na.rm =TRUE)# ==============================================================================# LESSON: Use read_csv(na) to convert specific values to NA while readingcereal3 <-read_csv("cereal.csv", na ="-999")cereal3$ratingrange(cereal3$rating, na.rm =TRUE)
Wrangle IV
Summarize
Although we store data about many observations…
…we often want to summarize across observations
This is like folding the tibble down to one row
We’ve seen functions that summarize vectors
length(), sum(), min(), max()
mean(), median(), sd(), var()
summarize() lets us use them on tibbles
It works very similarly to mutate()
It always creates a tibble as output
Summarize Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)sales <-tibble(customer =c(1, 2, 3, 1, 3),store =c("A", "A", "A", "B", "B"),items =c(25, 20, 16, 10, 5),spent =c(685, 590, 392, 185, 123) ) |>print()# ==============================================================================# USECASE: Summarize the typical salesmy_summary <- sales |>summarize(avg_items =mean(items),avg_spent =mean(spent) ) |>print()# ==============================================================================# PITFALL: Don't use summary() instead of summarize()my_summary <- sales |>summary(avg_items =mean(items),avg_spent =mean(spent) ) |>print() # not a tibble# ==============================================================================# USECASE: Use more than one summary functionmy_summary <- sales |>summarize(total_items =sum(items),total_spent =sum(spent),avg_items =mean(items),avg_spent =mean(spent) ) |>print()# ==============================================================================# USECASE: Use counting functionsmy_counts <- sales |>summarize(n_sales =n(),n_customers =n_distinct(customer),n_stores =n_distinct(store) ) |>print()
Group Summarize
We can also summarize a tibble by group
This is like folding the tibble multiple times
Specifically, we fold down to one row per group
The syntax for summarize is identical
The only difference is to the tibble
We first pass it through group_by()
Pipelines make this very easy
Group Summarize Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)sales <-tibble(customer =c(1, 2, 3, 1, 3),store =c("A", "A", "A", "B", "B"),items =c(25, 20, 16, 10, 5),spent =c(685, 590, 392, 185, 123) ) |>print()# ==============================================================================# LESSON: We pass a tibble through group_by to group itsalessales |>group_by(store) # note the display says "grouped"# ==============================================================================# USECASE: We can then summarize and get stats per groupsales |>group_by(store) |>summarize(customers =n_distinct(customer),items_sold =sum(items),total_sales =sum(spent),avg_items =mean(items),avg_spent =mean(spent) )# ==============================================================================# SETUP: Let's get a larger, more realistic dataset# Extra pane > Packages tab > Install > nycflights13library("nycflights13")flights# ==============================================================================# USECASE: Find the carrier with the lowest average delaysflights |>group_by(carrier) |>summarize(m_delay =mean(dep_delay, na.rm =TRUE)) |>arrange(m_delay)# ==============================================================================# LESSON: We can also group by multiple variables# USECASE: Let's find the day of the year with the most flightsflights |>group_by(month, day) |>summarize(n_flights =n()) |>arrange(desc(n_flights))
Visualize I
What is a graphic?
A data visualization expresses data through visual aesthetics.
Describing Graphics
Some simple graphics are easy to describe and may even have ready names.
Describing Graphics
A grammar of graphics will help us describe more complex graphics.
The Grammar of Graphics
The grammar of graphics is a set of rules for describing and creating data visualizations
To make our data visual (and therefore put our highly evolved occipital lobes to work)…
We connect variables to visual qualities
We represent observations as visual objects
This requires some fundamental elements
We will first learn about them in lecture
We will then apply them in R using {ggplot2}
Data
# A tibble: 234 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rows
Graphics require data (e.g., tibbles), which describe observations using variables.
Aesthetic Mappings
Graphics require aesthetic mappings, which connect data variables to visual qualities.
Scales
Graphics require scales, which connect specific data values to specific aesthetic values.
Geometric Objects
Graphics require geometric objects (geoms), which represent the observations.
ggplot2 Basics
The ggplot2 package is a part of tidyverse
No need to install or load it separately
It plays nicely with tibbles and wrangling
It implements the grammar of graphics in R
The “gg” stands for “grammar of graphics”
Thus, we will need to provide all four elements
We will create a pseudo-pipeline of commands
However, we will use + rather than |>
This is because {ggplot2} predates the R pipe
ggplot2 Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)mpg# ==============================================================================# LESSON: First, set the data to a tibblep <-ggplot(data = mpg)p# ==============================================================================# LESSON: Next, set the aesthetic mappings with aes()p <-ggplot(data = mpg, mapping =aes(x = displ, y = hwy))p# ==============================================================================# TIP: You can leave off the optional argument namesp <-ggplot(mpg, aes(x = displ, y = hwy))p# ==============================================================================# LESSON: Next, set the positional scalesp <-ggplot(mpg, aes(x = displ, y = hwy)) +scale_x_continuous(name ="Engine Size (in liters)", limits =c(1, 7), breaks =1:7 ) +scale_y_continuous(name ="Highway Fuel Efficiency (in miles/gallon)",limits =c(10, 50),breaks =c(10, 20, 30, 40, 50) )p# ==============================================================================# LESSON: Finally, add a point geomp <-ggplot(mpg, aes(x = displ, y = hwy)) +scale_x_continuous(name ="Engine Size (in liters)", limits =c(1, 7), breaks =1:7 ) +scale_y_continuous(name ="Highway Fuel Efficiency (in miles/gallon)",limits =c(10, 50),breaks =c(10, 20, 30, 40, 50) ) +geom_point()# ==============================================================================# TIP: If you leave off the scales, R will try to guessp <-ggplot(mpg, aes(x = displ, y = hwy)) +geom_point()p# ==============================================================================# LESSON: We can also customize the geom with argumentsp <-ggplot(mpg, aes(x = displ, y = hwy)) +geom_point(color ="red", shape ="square", size =2)p
Basic Layering
ggplot2 uses a layered grammar of graphics
We can keep stacking geoms on top
Layering adds a lot of possibilities
We can convey more complex ideas
We can learn more about our data
But we can still describe these graphics
Just describe each layer in turn
And describe the layers’ ordering
Basic Layering Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)mpg# ==============================================================================# USECASE: Add a smooth geom (i.e., line of best fit)ggplot(mpg, aes(x = displ, y = hwy)) +geom_point() +geom_smooth()ggplot(mpg, aes(x = displ, y = hwy)) +geom_point() +geom_smooth(method ="lm")# ==============================================================================# USECASE: Add a line geom (i.e., connecting points)economicsggplot(economics, aes(x = date, y = unemploy)) +geom_point()ggplot(economics, aes(x = date, y = unemploy)) +geom_point() +geom_line(color ="orange", size =1)ggplot(economics, aes(x = date, y = unemploy)) +geom_line(color ="orange", size =1) +geom_point()# ==============================================================================# USECASE: Add reference line geomsggplot(economics, aes(x = date, y = unemploy)) +geom_hline(yintercept =0, color ="orange", size =1) +geom_line(color ="blue", size =1) +geom_point()ggplot(economics, aes(x = date, y = unemploy)) +geom_vline(xintercept =7.5, color ="orange", size =1) +geom_line(color ="blue", size =1) +geom_point() ggplot(economics, aes(x = date, y = unemploy)) +geom_abline(intercept =4000, slope =0.5, color ="orange", size =1) +geom_line(color ="blue", size =1) +geom_point()
Working with Color
Color scales come in two main types:
Discrete scales have separate colors
Best with factor variables
Continuous scales form a gradient
Best with numeric variables
There are two ways to control color:
You can map color to a variable
It will take on different values
You can set color to a value
It will take on one value only
Color Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)mpg# ==============================================================================# USECASE: Continuous color scales work well with numeric variablesggplot(mpg, aes(x = hwy, y = cty, color = displ)) +geom_point(size =4)ggplot(mpg, aes(x = hwy, y = cty, color = displ)) +geom_point(size =4) +scale_color_continuous(type ="viridis")# ==============================================================================# USECASE: Use a discrete color scale with categorical variablesggplot(mpg, aes(x = displ, y = hwy, color = drv)) +geom_point()ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +geom_point() +scale_color_discrete(name ="Drivetrain", breaks =c("4", "f", "r"), labels =c("Four Wheel", "Front Wheel", "Rear Wheel") )# ==============================================================================# PITFALL: Don't forget to set categorical variables as factorsggplot(mpg, aes(x = displ, y = hwy, color = cyl)) +geom_point() # R guesses you want a continuous scaleggplot(mpg, aes(x = displ, y = hwy, color =factor(cyl))) +geom_point() +scale_color_discrete(name ="Cylinders")# ==============================================================================# LESSON: Set a geom's color aesthetic to make it always that colorggplot(mpg, aes(x = displ, y = hwy)) +geom_point(color ="red")# ==============================================================================# PITFALL: However, do this inside of geom() not aes()ggplot(mpg, aes(x = displ, y = hwy, color ="blue")) +geom_point() #unintended# ==============================================================================# LESSON: If you both set and map color, the setting will winggplot(mpg, aes(x = displ, y = hwy, color = drv)) +geom_point(color ="blue")
# SETUP: We will need tidyverse and an example graphiclibrary(tidyverse)p <-ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +geom_point() +labs(title ="Fuel Efficiency")p# ==============================================================================# USECASE: Apply a "complete" themep +theme_bw()p +theme_classic()p +theme_dark()# ==============================================================================# LESSON: More more precise control, we can use theme()p +theme(legend.position ="top")p +theme(plot.title =element_text(color ="purple", face ="bold"))p +theme(panel.grid =element_blank())# NOTE: There are a lot of elements to learn, so use a cheatsheet!
Exporting Graphics
We may need to export graphics from R
e.g., for a paper, poster, or presentation
This job is handling fantastically by ggsave()
We can create many types of files
We can customize the exact size
I recommend .png for most daily purposes
For publishing, I prefer .pdf or .svg
They retain perfect quality at any zoom
You can send these files to most publishers
Exporting Live Coding
# SETUP: We will need tidyverse and an example graphiclibrary(tidyverse)p <-ggplot(mpg, aes(x = displ, y = hwy)) +geom_point() +geom_smooth() +labs(x ="Engine Displacement", y ="Highway MPG")p# ==============================================================================# USECASE: Save a specific ggplot object to a fileggsave(filename ="pfinal.png", plot = p)# ==============================================================================# LESSON: Specify the size of the file to createggsave(filename ="pfinal2.png", plot = p, width =6, height =3, units ="in")# ==============================================================================# LESSON: Just change the extension to create a different file typeggsave(filename ="pfinal2.pdf", plot = p, width =6, height =3, units ="in")# ==============================================================================# PITFALL: Creating a very large file may lead to small textggsave(filename ="p_poster.png", plot = p, width =12, height =8, units ="in")# ==============================================================================# TIP: You can quickly increase the text size using base_sizep2 <- p +theme_grey(base_size =24)ggsave(filename ="p_poster2.png", plot = p2,width =12, height =8, units ="in")